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Unravelling Graph-Exchange File Formats 

Matthew Roughan and Jonathan Tuke 


Abstract —A graph is used to represent data in which the 
relationships between the objects in the data are at least as 
important as the objects themselves. Over the last two decades 
nearly a hnndred file formats have been proposed or used 
to provide portable access to such data. This paper seeks to 
review these formats, and provide some Insight to both rednce 
the ongoing creation of nnnecessary formats, and guide the 
development of new formats where needed. 

1. Introduction 

XCHANGE of data is a basic requirement of scientific 
research. Accurate exchange requires portable file for¬ 
mats, where portability means the ability to transfer (without 
extraordinary efforts) the data both between computers (hard¬ 
ware and operating system), and between software (different 
graph manipulation and analysis packages). 

A short search of the Internet revealed that there are well 
over 70 formats used for exchange of graph data: that is 
networks of vertices (nodes, switches, routers, ...) connected 
by edges (links, arcs, ...). 

It seems that every new tool for working with graphs derives 
its own new graph format. There are reasons for this: new tools 
are often aimed at providing a new capability. Sometimes this 
capability is not supported by existing formats. And inventing 
your own new format isn’t hard. 

More fundamentally, exchange of graph information just 
hasn’t been that important. Standardised formats for images 
(and other consumer data) are crucial for the functioning 
of digital society. Standardised graph formats affect a small 
community of researchers and tool builders. This community 
is growing, however, and the need for exchange of information 
is likewise growing, particularly where the data represent some 
real measurements that were expensive to collect. 

The tendency to create new formats in preference to using 
existing tools is unhelpful though, particularly as the time to 
“create” a format might be small, but the time to carefully test 
formats and read/write implementations is extensive. Reliable 
code is critical to maintain data quality, but many tool devel¬ 
opers seem to focus on features instead of well-audited code. 
Moreover support of formats, for instance clear documentation 
and ongoing bug fixes, is often lacking. 

An explosion of formats is therefore a poor state of affairs. 
The existing formats do include many of the features one 
might need, and some are quite extensible, so the bottleneck 
is not the existing formats so much as information about those 
formats. This is the gap this paper aims to fill. 

This work concentrates on graph exchange formats. Such 
formats have certain requirements above and beyond simple 
storage: most obviously portability. However, portability in 
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this context is not purely about syntax. Exchange also requires 
common definitions of the meaning of the attributes. 

On the other hand, file size is not a primary consideration. 
Hence many exchange formats pay little attention to this and 
related details (e.g., read/write performance). 

We concentrate on exchange formats, but some of the 
formats considered here were not originally developed with 
exchange in mind, but have become de facto exchange formats 
through use. In these cases we see reversals of objectives 
compared to some purpose-built exchange formats. We shall 
therefore consider a large range of such features for compari¬ 
son, noting as we do so that as exchange of very large datasets 
becomes important, the requirements will change. 

Many of the formats presented may seem obsolete. Some 
are quite old (in computer science years). Some have clearly 
not survived beyond the needs of the authors’ own pet project. 
However, we have listed as many as we could properly doc¬ 
ument, partially for historical reference, and partially to show 
the degree of reinvention in this area. But more importantly, 
because old and obscure isn’t bad. Eor instance NetML, a 
format that doesn’t seem to be used at all by any current 
toolkits, incorporates some of the most advanced ideas of any 
format presented. A good deal could be learnt by current tool 
builders if they were to reread the old documentation on this 
format. 

It is important to note that this paper does not present yet- 
another format of our own. It is common in this and other 
domains for the discussion of previous works to be coloured 
by the need to justify the authors’ own proposals. Here we 
aim to be unbiased by the need to motivate our own toolkit, 
and so (despite temptation) do not provide any such. 

We do not argue that new graph formats should never be 
developed. In some applications new features are needed that 
are not present in the existing formats. However, it is critical 
that those who wish to propose new ideas should understand 
whether they are really needed. Moreover, in studying the 
existing formats, and their features, we learn what should be 
required in any new format to make it more than a one-shot, 
aimed at only one application. In fact, the results suggest that 
new formats are desirable for several reasons, but that perhaps 
what would be more useful would be a container format 
capable of providing self-documentation and meta-data-like 
features, while encapsulating a set of formats with variable 
levels of feature support. 

So the value of this work is threefold: firstly it provides 
a relatively complete set of information about the currently 
available formats, secondly it provides a basis for selection of 
a suitable format, and thirdly it provides information about the 
nature of the features that can and have been used in future 
developments of graph exchange strategies. 
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II. Background 

Graphs (alternatively called networks) have been used for 
many years to represent relationships between objects or 
people. 

A mathematical graph Q is a set of nodes (or vertices) JV 
and edges (or links or arcs) £ cAf xN. 

An alternative representation of a graph can be given 
through its adjacency matrix A, defined by 

1, if {i,j)e£, 

0, otherwise. 

Other representations exist (and are discussed below in detail). 
These alternatives are often used to create computationally 
efficient operations on the graph. Underlying these alternatives 
is the choice of the first-class objects to be represented; 
mathematically, the graph is the first-class object, and nodes 
and edges are components of the graph, but it is useful, 
for instance, to represent the edges as a set of objects each 
with their own components (including their end-points), or to 
represent the nodes as the first-class objects, with edges as 
properties of the nodes. Each alternative has advantages in 
terms of particular algorithms that can be applied. 

Additional information is often added to a graph: for in¬ 
stance 

• node or link labels (names, types, ...); 

• values (distances, capacity, size, ...); and 

• routing (paths taken when traversing the graph). 

This additional information is often critical to make use of the 
graph data in any real application. 

It has been necessary for many years for researchers in 
sociology, biology, chemistry, computer science, mathematics, 
statistics and other areas to be able to store graphs representing 
concepts as diverse as state-transition diagrams, computer- 
software structure, social networks, biochemical interactions, 
neural networks, Bayesian inference networks, genealogies, 
computer networks, and many more. Researchers also need 
to share data. They have done so by sharing files. As a result 
portable file formats for describing graphs have been around 
for decades. 

This document is concerned with providing information 
about these formats, specifically with the intention of moving 
towards a smaller number of standard formats (the current 
trend seems to be progressing in the other direction). 

We only look here at publicly disclosed formats, for the 
obvious reason that a format can’t really be called a data 
exchange format unless its definition is public. It is fair to say 
that although many were intended for exchange of information, 
most failed at this and were only really used for a single tool 
or database of graphs. In a few other cases, the format was not 
intended as an exchange format, but has become a de facto 
exchange format by virtue of the inclusion of lO routines 
in other software than its originator. In any case, we have 
tried to be inclusive here: we include anything that might be 
reasonably called an exchange format (and which is publicly 
documented to some degree), rather than trying to exclude 
those which we guess are not. 

There are many subtypes of graphs, and generalisations. For 
instance: the general description above is that of a directed 


graph. An undirected graph has the property that if {i,j) & £ 
then so too is {j, i). 

It is important to note that it is often possible to represent 
one type of graph in terms of the other: for instance an 
undirected graph may be represented by a directed graph 
by including all reverse links in the data. However, this is 
inefficient. 

Moreover there is the issue of intention. The intention of the 
person storing the data is important; for instance, an undirected 
graph that is stored as a directed graph may be edited to 
become directed. A native undirected format enforces the 
correct semantics. Thus when considering the type of graph 
being stored, we consider the native or explicitly supported 
subtypes, not those that can be implicitly supported. 

Other generalisations of graphs include multi-graphs, hyper¬ 
graphs, and meta-graphs (described in more detail below). 
Subtypes include trees and DAGs (Directed Acyclic Graphs). 
Once again, it is often possible to represent these in terms of 
the simple directed graph, but often this will be inefficient, 
and deficient in terms of intention. We will therefore look for 
native support for these generalisations and subtypes. 

A. Related work 

We distinguish this work from the study of graph databases, 
which have a similar role in storing data where the rela¬ 
tionships have at least as much importance as the entities 
they relate. However, although they may hold the same type 
of data that we are considering here, the motivations for a 
graph database are different. Typically, those concerned with 
databases are interested in ACID (Atomicity, Consistency, Iso¬ 
lation, Durability) and other similar properties. The underlying 
assumption is that the data is changing dynamically according 
to some set of transactions and operations and that the database 
should work correctly under these conditions. Consequently 
graph databases are not simply concerned with the structure 
and description of the data, but also how that data may be 
operated on, and queried. On the other hand, the standard 
assumption in data exchange is that the data itself is relatively 
static, but portability is important. 

There is a wide-ranging survey of graph databases [1], 
which is more concerned with the underlying database aspects, 
e.g., the relationship between a graph database and other more 
traditional databases such as a relational database, and the 
properties of various exemplar graph databases. 

There is some overlap of concerns: in both cases there is 
some interest in data integrity, compression, and the like, but 
it is fair to say that these issues have typically taken second 
place in the design of graph exchange formats. 

There have been a number of other efforts to gather similar 
information on graph exchange formats by researchers [2]- 
[4] and software distributors [5], [6]. The results provided 
inspiration for some of the descriptors used here, but this paper 
aims to provide a more comprehensive summary. 

One additional paper to consider is [7], which was written 
specifically with the view of designing a new, more universal 
graph format. We deliberately avoid this approach in order to 
avoid bias in our discussion. 
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III. The File Formats 

As noted the aim here is to describe graph exchange formats, 
i.e., formats that are used to exchange data between scientists 
and programming environments. Not all of the formats started 
out that way - some were intended as internal formats for a 
particular software system, but have become de facto exchange 
formats when another system sought to leverage existing data 
by incorporating an existing format. A few of the formats are 
still primarily internal to a single system, but are important 
to describe because they exhibit an interesting feature. In the 
main we concentrate on those that were designed with data 
exchange in mind, or have been used in that way in practice. 

This list is incomplete. There are some formats that we 
have observed in the literature, but have been unable to find 
documented {e.g., Gem2Ddraw), or which appear to only be 
used as an internal format for a single tool. The graph formats 
we know of that have been excluded are the Tom Sawyer 
format, gem3Ddraw, PROGRES, GTXL, GedML, UXF, GRL, 
VEGA, BLGF, GraphLab, BNIF, BIE, XGML, NMF, Inflow, 
GDS, Tnet and RDF. Additional information sources covering 
these would be welcomed. 

There are a few formats that we have lumped together 
under the general heading of TGF (the Trivial Graph Format) 
because they are all functionally equivalent to a delimited 
edge list. There is no point in listing every variant of this 
approach: there are many and they vary mainly on the choice 
of storage (plain ASCII through to Excel), and delimiter (tabs 
and commas are common). 

There are many file formats that could, in principle, contain 
a graph: e.g., XML, JSON, SGML, Avro, YAML (YAML 
Ain’t Markup Language), RDF (the Resource Description 
Framework), HDF (the Hierarchical Data Format). For that 
matter any image file could contain an adjacency matrix. 
Unless there is a specific extension of these designed to 
provide support for graphs, in which case we list the specific 
not the generic. For instance, several software tools say that 
they can read/write JSON or other generic serialisations of 
data, but without details of exactly what is being serialised, 
then these are not useful exchange formats. We treat Matlab’s 
. mat format as a special case because it has explicitly been 
used to exchange graph data, at least between instances of 
Matlab, even though it is a generic data format. 

We also aim to avoid, for simple practicality, formats that 
represent data that has a graph structure, but whose main 
content is not the graph. For instance HTML: the graph 
structure of the WWW is vastly smaller than the content and 
HTML is intended to store both in a distributed fashion. If 
one wished to represent the graph of the WWW, then another 
format seems indicated. Other examples include SBML (the 
Systems Biology Markup Language), and FOAF [8] (Friend 
Of A Friend). 

Table I provides the list of exchange formats we do include, 
as well as links and references. Check marks in the table 
indicate that we have had at least cursory feedback about 
the information in the table from one of the creators or 
maintainers of the format (we received such feedback on 23 
of the formats). Please see the acknowledgements for a list of 


contributors. 

We have also tried to include a reference time frame to 
provide some historical context for the format. The dates are 
based on explicit records from the first recorded reference to 
the format, through to the last recorded date of maintenance. 
However, this information is often not supplied, so we have 
used the closest available proxy. For instance, change-logs or 
copyright dates on format documentation or publication dates 
for papers. Hence these should not be seen as a completely 
reliable data. It is an attempt to document the historical 
development of this field, so much of which is not in the 
archival journals*. 

For instance. Figure 1 provides a quick summary. We can 
see that there was a flurry of activity in the late 90s continuing 
on until today, but the style of contributions has changed over 
time. It is interesting to see how XML became flavour of the 
day around in the late 90s, and then dropped out of popularity 
in recent years, and in the most recent past there seem to be 
several efforts to design graph formats on top of JSON. It 
seems there are fads even within technical fields. 



Year 


Fig. 1: New format origination dates. 


'please note that some formats that are notionally obsolete according to 
reference dates, but may still be used by archival stores of graph data. 
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Graph Format Full Name Reference time 

frame 


1 

bintsv4 

[9] 


bintsv4 (GraphLab) 

2009 

present 

2 

BioGRID TAB 

[10], [11] 

/ 

BioGRID TAB 2.0 Format 

2003 

present 

3 

BLAG, GDToolkit 

[12] 


Batch layout generator (GDToolkit) 

1998 

2008 

4 

BVGraph 

[13] 

/ 

Boldi-Vigna graph compression 

2004 

2011 

5 

Chaco 

[14] 

/ 

Chaco graph format 

1994 

1995 

6 

Cluto 

[15] 


Cluto/Metis/Graclus format 

1999 

2008 

7 

DGS 

[16] 

/ 

Dynamic GraphStream Fomiat 

2010 

2013 

8 

DGML 

[17] 


Directed Graph Markup Language 

2009 

2013 

9 

DIMACS 

[18] 


DIMACS graph format 

2006 

2006 

10 

Dot 

[19] 


GraphVis Dot Language 

2000 

present 

11 

DotML 

[20] 


Dot Markup Language 

2002 

2010 

12 

DyNetML 

[21] 


DyNetML XML 

2001 

2009 

13 

GAMFF 

[22] 


A Graph and Matrix Format 

1995 

1995 

14 

GDF 

[23] 


Guess Data Format 

2007 

2010 

15 

GDL 

[24] 


Graph Description Language 

1993 

1995 

16 

GEDCOM 

[25] 


Geneaological data 

1987 

1996 

17 

GEXF 

[26] 

/ 

Graph Exchange XML Format 

2007 

2012 

18 

GML 

[27] 

/ 

Graph Modelling Language 

1995 

1999 

19 

Graph6 

[28] 

/ 

Graph6 

1996 

2011 

20 

Graph: :Easy 

[29] 

/ 

Perl Graph: :Easy format 

2004 

present 

21 

GraphEd 

[30], [31] 


GraphEd simple format 

1994 

1994 

22 

GraphJSON 

[32] 


Graph ISON 

2013 

2014 

23 

GraphML 

[33] 

/ 

Graph Markup Language 

2000 

present 

24 

GraphSON 

[32] 


TinkerPop’s JSON-based Graph format 

2011 

2013 

25 

GraphXML 

[34] 

/ 

XML-Based Graph Description Language 

1998 

1998 

26 

GraX 

[35] 


GraX 

1999 

1999 

27 

GRXL 

[36] 


XML Specification for Grrr Program 

2000 

2000 

28 

GT-ITM 

[37] 


Georgia Tech Internetwork Topology Models 

1996 

1998 

29 

GXL 

[38] 

/ 

Graph eXchange Language 

1999 

2006 

30 

Harwell-Boeing 

[39] 


Harwell-Boeing sparse (TGFaceny) matri 

1992 

2010 

31 

Inet 

[40] 


Inet Topology Generator file 

2000 

2002 

32 

ITDK 

[41] 

/ 

CAIDA Internet Topology Data Kit 

2002 

present 

33 

JSON Graph 

[42] 


json-graph-specification 

2014 

present 

34 

LEDA 

[43] 


LEDA format 

2001 

2008 

35 

LGE 

[44], [45] 


LEMON Graph Format 

2008 

present 

36 

LGL 

[46], [47] 


Large Graph Layout 

2003 

2005 

37 

LibSea 

[48] 

/ 

CAIDA LibSea format 

2000 

2005 

38 

KrackPlot 

[49] 

/ 

KrackPlot data format 

1993 

present 

39 

Matlab 

[50] 


Matlab saved workspace 

1996 

present 

40 

Matrix 

[51] 


Matrix Market sparse matrix 

1996 

2013 

41 

Mi via 

[52], [53] 


Mivia ARG database format 

2001 

2003 

42 

MultiNet 

[54], [55] 


MultiNet 

1999 

2007 

43 

Netdraw VNA 

[56], [57] 


Netdraw VNA 

2005 

2008 

44 

NetML 

[58] 


Network Markup Language 

1995 

1995 

45 

Ncol 

[46], [47] 


Large Graph Layout 

2003 

2005 

46 

NNF 

[59], [60] 


Nested Network Format 

2003 

present 

47 

Nod 

[49] 


KrackPlot Node format 

1993 

present 

48 

NOS 

[61] 


Neo Org Stat format 

2000 

2013 

49 

ns-tcl 

[62], [63] 


ns-2 Tcl network definition 

1989 

2011 

50 

OGDL 

[64] 

/ 

Ordered Graph Data Language 

2002 

present 

51 

OGML 

[65], [66] 


Open Graph Markup Language 

2012 

present 

52 

Osprey 

[67] 


Osprey file format 

2001 

2008 

53 

Otter 

[68] 


Otter’s native format 

1999 

1999 

54 

Pajek (.net) 

[69]^71] 

/ 

Pajek Tool’s .net format 

1996 

present 

55 

Pajek (.paj) 

[69]471] 

/ 

Pajek Tool project (.clu, .vec, .per, ...) 

1996 

present 

56 

Planar 

[28] 

/ 

Plantri Planar Code andedgeCode 

1996 

2011 

57 

PSI MI 

[72] 


Protenomics Standards Initiative Molecular Interaction 

2002 

present 

58 

RSE 

[73] 


Rigi Standard Format 

1999 

2010 

59 

Rocketfuel 

[74] 


Rocketfuel ISP Maps 

2002 

2003 

60 

Rutherford-B oeing 

[75] 


Rutherford-Boeing sparse (TGFaceny) matri 

1997 

1997 

61 

SGB 

[76], [77] 


Stanford GraphBase 

1992 

2009 

62 

SGE 

[78], [79] 


Structured Graph Format 

1998 

1999 

63 

S-Dot 

[80] 


S-Dot (lisp interface to Graphviz) 

2006 

2010 

64 

SIP 

[59], [60] 


Simple Interaction Format 

2003 

present 

65 

SNAP 

[81] 

/ 

Stanford Network Analysis Platfomi 

2005 

present 

65 

SoNIA 

[82], [83] 

/ 

So NIA Son format 

2002 

present 

67 

Sparseb 

[28] 

/ 

Sparseb 

1996 

2011 

68 

StOCNET 

[84] 


StOCNET native format 

2002 

2007 

69 

TEI 

[85] 


Text Encoding Initiative Graph Format (XML-compatible) 

2001 

present 

70 

TGP, TGP 

[86] 

/ 

Trival Graph Format, and other simpleedgelists (CSV, TSV, Excel, ...) 

NA - 

NA 

71 

Tulip TLP 

[87], [88] 


Tulip graph format 

2002 

2012 

72 

UCINET DL 

[89], [90] 


UCINET Data Language 

2002 

2013 

73 

XGMML 

[91] 


extensible Graph Markup and Modeling Language 

2000 

2001 

74 

XMLBIP 

[92] 


XML-based BayesNets Interchange Format 

1998 

2013 

75 

XTND 

[93] 


XML Transition Network Definition 

2000 

2000 

76 

YGP 

[94] 

/ 

Y Graph Format 

2004 

present 


TABLE I: The format list. Checkmarks indicate formats that have had their details audited by someone associated with creation 
or maintenance of the format. 
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IV. Descriptors and Discriminators 

In order to describe the formats we will consider here, 
we need some simple means to compare and contrast. Of 
a necessity, these will oversimplify some of the issues. For 
instance, where a format uses multiple hies we have not 
attempted to explain exactly how data is divided between these 
hies. 

What’s more, many descriptions of hie formats are impre¬ 
cise. It is common to describe the format by reference to 
examples. Although useful for simple cases, these leave out 
important details: for instance: the character set supported, 
and even more surprisingly, the format of identihers. It is 
often vaguely suggested that these are numbers, but without 
formal dehnition of what is allowed (presumably non-negative 
integers, but are numbers outside the 32 bit range supported?). 

In the following, we make the best estimate of the capabili¬ 
ties of each format through reference to online documentation, 
and through a survey of the hie format creators. In many 
cases the results are inferences, so in this section we will 
outline the features we describe, and the assumptions made 
in compiling our data. However, we have made the best effort 
possible to contact authors of formats, and their comments 
about capabilities have been given precedence. 

There are three main types of descriptors here: 

Hie type : these are simple issues of the type of hie storing 
the data: binary vs ASCII, etc. 

graph types : this refers to the nature of the graph data that 
can be stored. 

attributes : these are features related to supplemental data 
about nodes and edges, such as labels and values associ¬ 
ated with these. 

general : this is a grab bag for additional features that don’t 
ht in either of the previous classes. 

We’ll describe each of these in detail below, and then provide 
a table of the features vs hie formats. 

One last point, this is not intended as a pejorative list. We 
do not mean to imply that having a feature is good or bad. 
The aim is to provide potential users with the background to 
choose the right format for their purposes. 

A. File Type 

encoding : This is, in principle, a simple distinction in hie 
type between text and binary hies. However, text hies 
today can use multiple different character sets, and this 
is important because some graphs will be labelled with 
non-English character sets. However, the majority of hle- 
format dehnitions leave unspecihed the character set to 
be used. We assume here that the character set is ASCII, 
unless there is some indication otherwise, either an ex¬ 
plicit statement, or in the case of applications of XML it 
is assumed that the character set supported is Unicode. 
Figure 2 indicates the proportions of hies providing each 
type of encoding. 

representation : Methods to represent a graph include: 
matrix : The graph’s full adjacency matrix, 
edge : A list of the graph’s edges [95]. 



Storage type 

■ ascii 
lascii/binary 

■ binary 
Ml SO 8859 


Unicode 

UTF-8 


Fig. 2: Support for different encodings. 


smatrix : The matrix representation is poor for sparse 
graphs, which are common in real situations. However, 
some tools actually store a sparse matrix, which is 
almost equivalent to an edge list^. There is a subtle 
difference in that a matrix view of the edges in a 
network cannot contain much detail about the edges 
(only one number), and so we have a separate name, 
smatrix, for formats that use this type of representation. 

neighbour lists : This is a list of the graph’s nodes, each 
giving a list of neighbours for each node. Often called 
adjacency lists we avoid that term because it is easily 
confused with the edge list. 

path : One can also implicitly represent a graph as a 
series of path descriptions (essentially a path is a list of 
consecutive edges). This could be useful, for instance, 
with a tree or ring. 

Moreover, graph data is often derived from path data, 
i.e., a series of paths are analysed, and the edges on 
these become the graph. In other cases, one might like 
to store path information, for instance related to routes 
along with the graph. 

constructive : Graphs can often be described in terms of 
mathematical operations used to construct the graphs: 
for instance graph products on smaller graphs [96]. See 
[58] for a description of “levels” of graph formats. 
Apart from simple incremental construction, the only 
format that seems to allow this is NetML [58]. 

procedural : Many graphs can be concisely dehned by 
a set of procedures, rather than explicit dehnition of 
the nodes and links. This type of graph format could 
be very concise, but verges on creating another pro¬ 
gramming language. In fact, many graph libraries for 
particular programming languages essentially provide 
this, but in a non-portable manner. 

The only generic (language independent) format that 
seems to allow this is NetML [58]. 

Any procedural approach admits the possibility of 
dehning a method for constructive graph description. 


^There is one exception to this: Cluto stores sparse matrices in a format 
more closely resembling the neighbour representation. 
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(a) Adjacency matrix 

(b) Edge list. 

(c) Neighbour lists. 

(d) Paths. 


Fig. 3: Simple directed graph with representations. 


Representation 

I edge 

edge/const/proc 

edge/matrix 

edge/neigh 

edge/neigh/matrix 

edge/path 

Jedge/paths 

edge/procedural 

matrix 

matrix/smatrix 

neigh 

neigh/edge/matrix 

smatrix 

Bsmatrix/matrix 


Fig. 4; Proportions supporting different representations. 


but we do not automatically count any procedural 
approach as constructive, unless it provides explicit 
graph-related operations as part of the toolkit. 

These representations are given varying names in the 
literature, but we use the names above to be clear. 
Figure 3 illustrates four of these, and Figure 4 shows 
the proportions 

The representation is important; for a graph with N ver¬ 
tices and E edges, the adjacency matrix requires 0{N‘^) 
terms, the edge list 0{E) terms, and the neighbour list 
0{N + E) terms. However, the terms in a matrix are 
{0,1} whereas the terms in the edge and neighbour 
lists are node identifiers (consider they might be 64 bit 
integers), so the size of a resulting file based on each 
representation depends on many issues, including the way 
the data is stored in the file. No approach is universally 
superior. 

Moreover, some may be easier to read and write: for in¬ 
stance a neighbour listing may be slightly more compact 
than an edge list, but the latter has the same number of 
elements per line, potentially making it easier to perform 


lO in some languages. 

More subtly, a neighbour-list representation treats edges 
as properties of nodes, whereas an edge list treats edges 
as objects in their own right; and the matrix representation 
treats the graph as the only object with nodes and edges 
as properties of the graph. Although a program can 
internally represent data however it likes, and read in a 
neighbour list into structures that treat edges as objects in 
their own right, the native treatment of data is reflected in 
the ease with which attributes can be added. For instance, 
in a neighbour list it is intrinsically harder to record 
attributes for edges, and in the matrix representation it is 
harder to record attributes for nodes. This is, fundamen¬ 
tally, why we regard edge-list and sparse-matrix formats 
as different. 

Some graph file formats allow alternative representations, 
and so we list all that are possible. However note that this 
is often actually multiple file formats under one name. It 
seems rare to allow a mixed representation. 

We haven’t (yet?) reported on whether edge-list formats 
explicitly lists nodes or only implicitly lists them as a 
consequence of edges. The latter is briefer, but requires 
a special case for degree 0 nodes. 

When considering generalisations of graphs, other repre¬ 
sentations are possible (for instance tensors can gener¬ 
alise the concept of an adjacency matrix for multi-layer 
networks). However, codification of these is an ongoing 
research topic [97] and so we will not try to encapsulate 
it here. 

structure: This field describes how the file format’s structure 
is defined. The cases are: 

simple ; the typical approach to create a graph format 
is to use one line per data item (a node, an edge, 
or a neighbourhood), with the components of a line 
separated by a standard delineator (a comma, tab, or 
whitespace). There are many variations on this theme, 
some more complex than others, for instance including 
labels, comments or other information. These formats 
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Structure 

I BNF 

Intermediate 

JSON 

Other 

Simple 

XML 


Fig. 5; Proportions with each structure. 


are usually specihed by a very brief description and 
one or two examples. They rarely specify details such 
as integer range or character set. 
intermediate : this is a slight advance on a simple hie 
format, in that it includes some grammatical elements. 
For instance, the hie may allow dehnition of new types 
of labels for objects. However, in common with simple 
hies, these are usually only specihed by a very brief 
description and one or two examples, not a complete 
grammar. 

BNF ; means that the hie format is described using a 
grammar, loosely equivalent to a Backus-Naur Form 
(BNF). This is perhaps the most concise, precise de¬ 
scription. When done properly it precisely spells out 
the details of the hie in a relatively short form. 

XML, JSON, SGML, ... : many graph hie formats ex¬ 
tend XML, JSON, SGML, or similar generic, exten¬ 
sible hie formats. This is a natural approach to the 
problem, and allows a specihcation as precise as BNF, 
though only through reference to the format being 
extended. Thus it is precise, but sometimes rather 
difficult to ascertain all of the details, unless one is 
an expert in XML, etc. 

On the other hand, these approaches draw on the wealth 
of tools and knowledge about these data formats. On 
the other hand again, to use those tools the model of 
your graph object has to map to the XML model (or 
at least be easily transformed into that form). 

Tcl, Lisp, ... : As noted above one approach to dehning 
a graph is procedural. Most of the approaches that 
allow this are extensions or libraries for common 
programming languages. 

We will not list every programming language and 
library as a data format though because, generically, 
such approaches are not portable between program¬ 
ming languages. We do mention a few formats though 
(ns-2 and S-Dot), because translators exist from/or to 
these from other data formats. 

Figure 5 shows the proportion of each type of structure 
within the hies. 


single or multiple files ; Most hie formats use a single hie, 
but some formats require multiple, for instance, a separate 
hies for the lists of nodes and edges. Other formats allow 
supplementary information in additional hies, so multiple 
hies aren’t mandatory. We have only classihed the hies 
by whether multiple hies are allowed, not whether they 
are mandatory (because the latter requires a distinction 
about what mandatory would mean; does it mean they are 
required to support basic features or advanced features?) 

integral meta-data ; Meta-data is data about the graph; for 
example, its name, its author, the date created, and so on. 
This is very important data, but many formats provide no 
means to include it in the hie, and instead rely on external 
records. We refer to meta-data as integral if it is contained 
in the hie itself. 

Some formats allow meta-data through unstructured com¬ 
ments. This is better than nothing, but lack of structure 
of the comments means these are not machine readable, 
in general. 

Some hie formats provide only a limited range of meta¬ 
data helds, whereas others are arbitrarily extensible. To 
distinguish the various cases we hll this held with one of 
the following; 

no ; No meta-data is allowed. 

comments ; Unstructured meta-data is allowed in com¬ 
ments. 

fixed ; A dehned set of meta-data can be included, e.g., 
a date or name held is predehned as part of the format, 
arbitrary ; An explicit mechanism is described to allow 
the user to specify arbitrary meta-data to be included. 
The value of meta-data is clear, but once again, let us 
reiterate that there are plusses and minuses in different 
approaches. For instance, arbitrary meta-data may seem 
superior, but can then lead to ambiguity about what meta¬ 
data should be kept for each dataset, whereas having 
a hxed list of attributes can make it obvious what is 
expected. However, it is common for formats to have 
support downwards, e.g., formats with hxed attributes 
often also support comments, and those with arbitrary 
properties often support some set of hxed properties and 
comments. 

Figure 6 shows support for various types of meta-data in 
the formats. 

built-in compression ; It is easy enough to compress a graph- 
hle using common utilities such as gzip, and typical 
compression ratio will be reasonably good as graph hies 
often have many repeated strings. However, one format 
(BVGraph) provides for compression of the graph as it 
is written, in much the way image hie formats allow 
intrinsic compression of the image. 

Graph Compression algorithms have been a topic of 
study at least since 2001 [13], [98], [99], with numerous 
followups. So it is interesting that only one format is 
designed around this feature. However, two other formats 
provided some crude mechanisms to reduce the size of 
the hie. Finally DGS formally acknowledges the role of 
compression by requiring that a gzipped hie be accepted 





Fig. 6; Support for meta-data. 


by software reading its format. 

Table II provides the information on file types. 


B. Graph Types 

directed/undirected ; The two basic forms of graph are the 
directed and undirected graph. In the former edges (or 
arcs) imply a relation from one node to another. In the 
later an edge implies a relationship in both directions. 
Some graph formats specify one or the other; others allow 
the user to specify either, and the most general allow the 
user to specify the type of each edge^. In one case, the 
format is explicitly restricted to DAGs (Directed Acyclic 
Graphs). 

Many graph formats fail to specify their type. In that 
case we assume it is directed if the edges/arcs are 
specified by directional nomenclature {e.g., from/to or 
source/destination). We also assume that matrix formats 
are directed unless there is specific mention of mechanism 
to represent the upper triangular part of the matrix alone. 



Directed 


I DAG 
directed 
either 

either or tree 

either/bipartite 

mixed 

I pianar 
undirected 
unspecified 


Fig. 7: Graph type support. 


^Of course a directed graph format can contain an undirected graph by 
including edges in both directions, but we are considering here whether it can 
do this a little more succinctly. 


multi-graph ; A multi-graph is a graph generalisation that 
allows (i) self-loops, and (ii) more than one edge between 
a single pair of nodes. 

Some formats specifically allow, or disallow multi¬ 
graphs. A few allow loops, but not multi-edges. Many, 
however, say nothing on the topic. We assume in this case 
that formats presenting either matrix or neighbour lists 
representations don’t allow multi-graphs. It is technically 
possible to represent a multi-graph in these cases, but this 
would require special processing of the information, and 
unless we see an indication this is present we assume it 
is not. Edge lists, however, can easily cope with multi¬ 
graphs. We suspect it is left to the software supporting 
the data format to make a decision about how to deal 
with these cases, and the decision may be inconsistent 
between supporting software. Hence it seems important 
that when an edge-based representation leaves the ques¬ 
tion unspecified, we note this status, 
hyper-graphs ; A hyper-graph allows edges that connect 
more than two nodes. These are useful for some prob¬ 
lems: for instance indicating a multi-access medium in a 
computer network (such as a wireless network). 

One can realise hyper-graphs using existing graph repre¬ 
sentations by adding a new type of node (representing the 
hyper-edge) and creating simple edges from this to all the 
hyper-edge adjacencies (which can then be represented by 
a node-neighbour or adjacency list for a bipartite graph); 
or by creating “groups”, whose membership represents 
the hyper-edge. Hence existing formats can often support 
hyper-edges in principle. However, true support needs 
specialised data for hyper-edges in the software reading 
or writing the data, so unless a format explicitly states it 
can support these and presents the mechanism, we assume 
it cannot. 

As a point to note, if hyper-graph support is intended 
to be included in a data format, then the list of graph 
representations is expanded to include the means of 
describing a hyper-edge: 

direct : The groups/hyper-edges are directly defined by 
listing the set of nodes included in each; 
indirect : Node definitions include a group-membership 
attribute that defines which nodes are connected by the 
defined group; 

hmatrix : A {0,1} matrix of size N x E (where there 
are N nodes and E hyper-edges) maps nodes to hyper¬ 
edges. A sparse shmatrix version of this could be 
stored. 

These representations are illustrated in Figure 8. As 
before none is universally superior, though the direct 
method seems likely to win for most realistic graphs, 
hierarchy : It is common for graphs to have sub-structure, 
for instance nodes that themselves contain graphs. 
Several formats provide mechanisms to record this sub¬ 
structure. Unfortunately, there does not seem to be a 
consistently used definition of this type of structure [100], 
and so we see differences not just in the representation, 
but also what exactly is being represented. The problem 
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Graph Format 

encoding 

representation 

structnre 

integral metadata 

built-in 

compression 

bintsv4 

binary 

edge 

simple 



BioGRID TAB 

ASCII 

edge 

simple 

comments 


BLAG, GDToolkit 

ASCII 

neigh 

BNF 

fixed 


BVGraph 

binaiy 

neigh 

simple 


/ 

Chaco 

ASCII 

neigh 

simple 

comments 


Cluto 

ASCII 

matrix/smatrix 

simple 


limited 

DGS 

UTF-8 

edge 

BNF 

arbitrary 


DGML 

Unicode 

edge 

XML 

fixed 


DIMACS 

ASCII 

edge/path 

simple 

fixed 


Dot 

UTF-8 

edge 

BNF 

arbitrary 


DotML 

Unicode 

edge 

XML 

arbitrary 


DyNetML 

Unicode 

edge 

XML 

fixed 


GAMFF 

ASCII 

smatrix/matrix 

BNF 

fixed 


GDF 

ASCII 

edge 

intermediate 



GDL 

ASCII 

edge 

BNF 

fixed 


GEDCOM 

UTF-8 

neigh 

BNF 

fixed 


GEXF 

UTF-8 

edge 

XML 

arbitrary 


GML 

ISO 8859 

edge 

BNF 

arbitrary 


Graph6 

coded ASCII 

matrix 

simple 


limited 

Graph: :Easy 

UTF-8 

edge/neigh 

intermediate 

fixed 


GraphEd 

ASCII 

neigh 

BNF 



GraphJSON 

UTF-8 

edge 

JSON 



GraphML 

Unicode 

edge 

XML 

arbitrary 


GraphSON 

UTF-8 

edge 

JSON 

arbitrary 


GraphXML 

Unicode 

edge 

XML 

arbitrary 


GraX 

Unicode 

edge/neigh 

XML 



GRXL 

Unicode 

edge 

XML 

arbitrary 


GT-ITM 

ASCII 

edge 

simple 



GXL 

Unicode 

edge 

XML 

arbitrary 


Harwell-Boeing 

ASCII 

smatrix 

simple 



Inet 

ASCII 

edge 

simple 



ITDK 

ASCII 

edge 

simple 

comments 


JSON Graph 

ASCII 

edge 

JSON 

arbitrary 


LEDA 

ASCII 

edge 

simple 



LGE 

ASCII 

edge 

intermediate 

arbitrary 


LGL 

ASCII 

neigh 

simple 



LibSea 

ASCII 

edge/path 

BNF 

arbitrary 


KrackPlot 

ASCII 

matrix 

simple 



Matlab 

binary 

matrix/smatrix 

HDF5 

arbitrary 

/ 

Matrix 

ASCII 

smatrix 

simple 

comments 


Mivia 

ASCII/binary 

edge 

simple 

comments 


MultiNet 

ASCII 

edge/matrix 

intermediate 



Netdraw VNA 

ASCII 

edge 

simple 



NetML 

Unicode 

edge/const/proc 

SGML 

fixed 


Ncol 

ASCII 

edge 

simple 



NNE 

ASCII 

edge 

simple 



Nod 

ASCII 

neigh 

simple 



NOS 

ASCII 

matrix 

simple 



ns-tcl 

ASCII 

edge/procedural 

Tcl 



OGDL 

ASCII (-1- 8-bit var.s) and binary 

edge/paths 

BNF 

comments 


OGML 

Unicode 

edge 

XML 

fixed 


Osprey 

ASCII 

edge 

simple 



Otter 

ASCII 

edge 

intermediate 

fixed 


Pajek (.net) 

UTF-8 

edge/neigh/matrix 

inteiTnediate 

comments 


Pajek (.paj) 

UTF-8 

edge/neigh/matrix 

intermediate 

comments 


Planar 

binaiy 

neigh 

simple 



PSI MI 

Unicode 

edge 

XML 

arbitrary 


RSE 

ASCII 

edge 

BNF 

comments 


Rocketfuel 

ASCII 

edge/path 

intermediate 



Rutheiford-B oeing 

ASCII 

smatrix 

inteiTnediate 

fixed 

limited 

SGB 

ASCII 

edge/neigh 

intermediate 

fixed 


SGE 

Unicode 

edge 

XML 

arbitrary 


S-Dot 

ASCII 

edge/procedural 

lisp 

arbitrary 


SIP 

ASCII (URL encode) 

edge/neigh 

simple 



SNAP 

ASCII 

edge 

simple 

comments 


SoNIA 

ASCII 

edge 

inteiTnediate 

comments 


Sparseb 

coded ASCII 

neigh 

simple 


limited 

StOCNET 

ASCII 

matrix 

simple 



TEI 

Unicode 

edge 

XML 

fixed 


TGP, TGP 

ASCII 

edge 

simple 

comments 


Tulip TLP 

ASCII 

edge 

BNF 

fixed 


UCINET DL 

ASCII 

neigh/edge/matrix 

inteiTnediate 



XGMML 

Unicode 

edge 

XML 

arbitrary 


XMLBIP 

Unicode 

neigh 

XML 

arbitrary 


XTND 

Unicode 

edge 

XML 

comments 


YGP 

binary 

edge 

simple 


/ 


TABLE II; File types (see § IV-A for explanation of columns). 
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Fig. 8: A hyper-graph with representations. 


becomes even more complicated when hierarchy and 
hyper-graphs are combined [97] (there is at least one 
proposed solution [100] but it does not seem to be widely 
used yet). 

Here, we simply note whether the format provides a 
version of hierarchy. 

meta-graph : A meta-graph [101] is a generalisation of a 
graph, multi-graph, hyper-graph, and hierarchical graph. 
Once again, a meta-graph could in principle be repre¬ 
sented using existing data structures (in much the same 
way that any data can in principle be represented in 
XML), so this fields refers to whether the format defines 
the representation. As far as we know, no format yet 
supports meta-graphs^, but this is included as a feature 
as an indication of the type of feature that might require 
a new format, or extended version of an existing format, 
edge-edge links : Generally, a graph has links between nodes, 
but we could generalise the concept to allow meta-edges 
that join edges as well (this is different from a meta¬ 
graph). 

This idea isn’t supported by many formats, and in the 
case of GraphML it is specified using the extensibility of 
GraphML, but again it is a useful example of the types 
of features that may be needed in the future. 

Table III shows which graph types are supported by each 
format. 

C. Attributes 

edge weights : A very common requirement is to store a 
numerical value associated with an edge. Genetically, we 
call this a weight. Many formats provide the facility to 
keep one such value. 

multiple attributes : Some formats allow one to keep multi¬ 
ple labels (numerical or otherwise) for each node and/or 
edge. 

Note the term “meta-graph” is somewhat overloaded, e.g., there is at least 
one package called metagraph that has nothing to do with the mathematical 
meta-graph. 


For some formats these are fixed {e.g., they allow a 
name and a value), whereas others allow arbitrary lists 
of attributes. 



Fig. 9: Multiple attribute support. 


default values : Specifying the value of a weight or attribute 
for every edge or node can be laborious (if it has to be 
done by hand), and wasteful of space. Moreover, it makes 
it hard to see structure in the data. Simply providing 
a default value for the common case can improve the 
situation. We include here the case of simple inheritance 
of values through a tree of “class” structures on the 
objects. For instance, nodes can be given a type which 
conveys a default value to be overridden by a more 
specific type or particular value. Notice here we are not 
speaking of inheritance through the graph itself, but a 
structure on top of the graph. 

multiple iuheritauce ; A few formats allow values to be de¬ 
rived through inheritance of values from multiple classes 
to which they belong. Thus they allow a node to have, 
for instance, a type “router” which conveys that it is 
an Internet router, with appropriate characteristics for 
such a device, from vendor “Cisco” which appropriate 
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Graph Format 

directed multi-graph hyper-graph hierarchy meta-graph edge-edge 

bintsv4 

directed 

BioGRID TAB 

directed 

BLAG, GDToolkit 

either / 

BVGraph 

directed 

Chaco 

undirected 

Cluto 

directed 

DGS 

mixed / 

DGML 

unspecified unspecified 

DIMACS 

either / 

Dot 

mixed / / / 

DotML 

mixed / / / 

DyNetML 

directed unspecified 

GAMFF 

either unspecified / 

GDF 

unspecified unspecified 

GDL 

directed unspecified 

GEDCOM 

mixed unspecified 

GEXF 

mixed / / 

GML 

either / 

Graph6 

directed loops only 

Graph::Easy 

mixed / / / 

GraphEd 

unspecified / 

GraphJSON 

mixed unspecified 

GraphML 

mixed / / / / 

GraphSON 

directed unspecified 

GraphXML 

either unspecified / 

GraX 

directed unspecified 

GRXL 

directed unspecified 

GT-ITM 

undirected 

GXL 

mixed / / / 

Harwell-Boeing 

directed 

Inet 

undirected 

ITDK 

undirected 

JSON Graph 

either / 

LEDA 

either unspecified 

LGF 

either unspecified 

LGL 

undirected unspecified 

LibSea 

directed unspecified 

KrackPlot 

directed 

Matlab 

directed 

Matrix 

either/bipartite 

Mivia 

directed unspecified 

MultiNet 

unspecified unspecified 

Netdraw VNA 

directed unspecified 

NetML 

either unspecified / 

Ncol 

undirected 

NNF 

either / 

Nod 

directed 

NOS 

directed 

ns-tcl 

directed / / / 

OGDL 

directed 

OGML 

directed unspecified / 

Osprey 

undirected 

Otter 

mixed unspecified 

Pajek (.net) 

mixed loops only 

Pajek (.paj) 

mixed loops only / 

Planar 

planar 

PSI MI 

unspecified / 

RSF 

directed unspecified 

Rocketfuel 

undirected / 

Rutherford-Boeing 

either 

SGB 

unspecified unspecified 

SGF 

directed unspecified / 

S-Dot 

mixed / / / 

SIF 

mixed 

SNAP 

either / 

SoNIA 

directed / / 

Sparseb 

undirected / 

StOCNET 

directed 

TEI 

either or tree unspecified 

TGF, TGF 

either 

Tulip TLP 

directed unspecified / 

UCINET DL 

directed 

XGMML 

either unspecified 

XMLBIF 

DAG 

XTND 

directed unspecified 

YGF 

mixed / 


TABLE III: Graph types (see § IV-B for explanation of columns). 




12 


characteristics for that vendor. 

Once again, inheritance is not through the structure of 
the graph, but through a further structure defined on the 
graph objects. 

visualisation data : Files that allow arbitrary attributes can 
always provide data to be used in visualising the graph, 
but here we refer to formats that explicitly provide such 
data. 

The level of visualisation data varies dramatically: some 
formats only allow position information for nodes, 
whereas others allow SVG definitions to be used in 
drawing the nodes. Still others provide guidance about 
which layout algorithms to use in displaying the graph. 
There is not space here to document all of the variations 
possible, so we simply indicate whether any such data is 
defined or not. 

ports : These are a specialised piece of layout information: 
often ports^ are often specified by a compass direction, 
and indicate where on a node the link should join to 
it. We include ports in addition to the previous field 
because port-based information can also carry semantic 
information about the relationship between links on a 
complex node: e.g., the arrangement of links on a real 
device like an Internet router. 

temporal data/dynamics : A topic of interest is analy¬ 
sis/visualisation of graphs as they change [95]. One way 
to store this information is as a series of “snap-shot” 
graphs, but storing it all together in the same file has 
some appeal. A few formats provide some variant on this: 
allowing links or nodes to be given a lifetime, or proving 
“edits” to the graph at specific epochs. 

Table IV explains the attribute features that are supported by 

each format. 

D. General 

extensible : Some formats allow extensibility in varying 
forms. We only consider them to have this facility, how¬ 
ever, if they provide an explicit mechanism. For instance, 
we do not regard all XML derivatives as intrinsically 
extensible because they could, in principle, be extended 
using standard XML techniques. The format has to ex¬ 
plain the explicit mechanism whereby it is extended. 
Simply adding extra attributes is not considered extensi¬ 
bility. 

schema checking : A format that provides an explicit mech¬ 
anism to check that a file is in a valid format is useful. 
We only say it has this facility if a tool exists to perform 
the check (a schema-checking program, DTD, or other 
similar formal tool). 

checksums : It is possible for large data files to become 
corrupted. A common preventative (or at least check for 
this problem) is to use a checksum. This is possible 
for all files, but we say that a given format has this 
capability if it includes it as a internal component (usually 
checking everything except the checksum itself). Only a 
few formats contain this check. 

^Ports are also called hooks in Pajek. 


external data references : Some formats allow reference to 
external files. This could be for visualisation data, meta¬ 
data, or other purposes. There are several approaches and 
views on external references, but we record whether it is 
expected that all relevant information will be in the file, 
or whether there might be something external. Again, we 
look for an explicit explanation of the mechanism, not 
implicit inheritance from the parent file format, 
multiple graphs : Some formats allow multiple graphs to be 
held in one file. Again, we only count this as a feature if 
the specification explains how explicitly, 
incremental specification : A small number of formats that 
present multiple graphs allow these graphs to be specified 
incrementally. This is subtly different from including 
temporal dynamics, as there is no implication of time, 
and the different graphs could potentially be unrelated 
(for instance, this might be used to describe graph edit 
distance problems). 

In a sense incremental specification is a simple case of 
constructive graph definition, but it is a very limited case, 
with specific application, so we list it separately. 

Table V provides information on the other features of the file 
formats. 

V. Data statistics 

In this section we statistically summarise the necessarily 
large tables presented earlier. Some of the charts already 
presented provide some details, but we explore in more detail 
by looking at the others to calculate the proportion of formats 
supporting each of the features listed. This is plotted in 
Figure 10. Note that in regard to features with multiple answers 
{e.g., representation), we break the possibilities into categories 
and list the proportion that support each category. 

Most obviously, there is a large support for edge represen¬ 
tations along with an edge weight. Visualisation data is also 
widely supported. 

Next we look at bivariate correlation between columns in the 
tables. For each pair of columns, we calculate a contingency 
table and then a P-value for the Fisher exact test [102], which 
is used because we have lots of small strata. Figure 11 shows 
the significantly correlated pairs, where this is defined as 
having a significant P-value after Bonferroni adjustment [103]. 

Many of the results are obvious. For instance, it is hardly 
surprising that there should be a significant correlation be¬ 
tween the file structure and schema checking. 

On the other hand there are many surprising effects: 

• hyper-graphs and ports are often associated; and 

• multi-graph and default values are also associated. 

These seem to be indications that the type of file author who 
thinks carefully about certain aspects of the file {e.g., the 
types of graphs that will be represented) also thinks about 
other aspects that require care. Thus dividing the formats in 
“careful” and “quick and dirty”. More work is required to 
establish if this connection is genuine or merely accidental. 
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Graph Format 

edgeweights multiple default multiple visualisation ports temporal 

attributes values inheritance data data/dynamics 

bintsv4 


BioGRID TAB 

fixed 

BLAG, GDToolkit 

/ 

BVGraph 


Chaco 

/ / 

Cluto 

/ 

DGS 

/ arbitrary / / 

DGML 

/ arbitrary / / / 

DIMACS 

/ / 

Dot 

/ arbitrary / / / 

DotML 

/ arbitrary / / / 

DyNetML 

/ arbitrary / 

GAMFF 

/ fixed / 

GDF 

/ arbitrary / 

GDL 

/ fixed / / 

GEDCOM 

fixed 

GEXF 

/ arbitrary / / / / 

GML 

/ arbitrary / 

Graph6 


Graph:: Easy 

/ fixed / / / 

GraphEd 

/ 

GraphJSON 

/ arbitrary / / / 

GraphML 

/ arbitrary / visualisation / / 

GraphSON 

/ arbitrary 

GraphXML 

/ arbitrary / / 

GraX 

/ arbitrary 

GRXL 

/ arbitrary / / 

GT-ITM 

/ fixed / 

GXL 

/ arbitrary 

Harwell-Boeing 

/ 

Inet 

/ / 

ITDK 


JSON Graph 

/ arbitrary 

LEDA 

/ fixed 

LGF 

/ / 

LGL 

/ 

LibSea 

/ / / / 

KrackPlot 

/ 

Matlab 

/ almost 

Matrix 

/ 

Mivia 

/ 

MultiNet 

/ arbitrary 

Netdraw VNA 

/ arbitrary 

NetML 

/ / / / 

Ncol 

/ 

NNF 


Nod 


NOS 

/ 

ns-tcl 

/ fixed / / / 

OGDL 


OGML 

? fixed / / 

Osprey 

fixed 

Otter 

arbitrary / 

Pajek (.net) 

/ fixed visualisation / / / 

Pajek (.paj) 

/ arbitrary visualisation / / / 

Planar 


PSI MI 

/ arbitrary 

RSF 

/ arbitrary 

Rocketfuel 

/ / 

Rutherford-B oeing 

/ fixed 

SGB 

/ fixed 

SGF 

/ arbitrary 

S-Dot 

/ arbitrary / / / 

SIF 


SNAP 


SoNIA 

/ arbitrary / / 

Sparseb 


StOCNET 

/ fixed 

TEI 

/ fixed 

TGF, TGF 


Tulip TLP 

/ arbitrary / 

UCINET DL 

/ 

XGMML 

/ arbitrary 

XMLBIF 

/ fixed 

XTND 

fixed 

YGF 

/ arbitrary / / 


TABLE IV: Allowed attributes (see §IV-C for explanation of columns). 
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Graph Format 

extensible schema checksums external data multiple incremental 

checking references graphs specifications 

bintsv4 


BioGRID TAB 

/ 

BLAG, GDToolkit 


BVGraph 


Chaco 


Cluto 


DGS 

CSS 

DGML 

/ 

DIMACS 

/ / 

Dot 

/ 

DotML 


DyNetML 

/ / / 

GAMFF 


GDF 

/ 

GDL 


GEDCOM 

/ / 

GEXF 

/ / / / 

GML 


Graph6 

/ / 

Graph: :Easy 

/ 

GraphEd 


GraphJSON 

/ 

GraphML 

/ / / / 

GraphSON 

/ / 

GraphXML 

/ / / / 

GraX 

/ 

GRXL 

/ / 

GT-ITM 


GXL 

/ / / / 

Harwell-Boeing 


Inet 


ITDK 


JSON Graph 

/ / 

LEDA 


LGF 


LGL 


LibSea 

/ ? 

KrackPlot 


Matlab 


Matrix 


Mivia 


MultiNet 


Netdraw VNA 


NetML 

/ / / 

Ncol 


NNF 


Nod 


NOS 

/ 

ns-tcl 

/ 

OGDL 

/ 

OGML 


Osprey 


Otter 


Pajek (.net) 

/ 

Pajek (.paj) 

/ 

Planar 


PSI MI 

/ / / 

RSF 


Rocketfuel 

/ 

Rutherford-Boeing 

/ 

SGB 

partial / 

SGF 

/ 

S-Dot 

/ 

SIF 


SNAP 


SoNIA 


Sparseb 

/ / 

StOCNET 


TEI 

/ 

TGF, TGF 


Tulip TLP 


UCINET DL 


XGMML 

/ 

XMLBIF 

/ 

XTND 


YGF 

/ 


TABLE V: Other properties (see §IV-D for explanation of columns). 
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edge weights 
visualisation data 
checked 
schema checking 
hierarchy 
multi graph 
extensible 
default values 
multiple graphs 
hyper graph 
ports 

external data references 
temporal data dynamics 
incremental specifications 
multiple inheritance 
checksums 
multiple attributes 
built in compression 
edge edge 

edge 

neigh 

matrix 

smatrix 

procedural 

constructive 



0.0 0.2 0.4 0.6 

Proportion with attribute 


Fig. 10: Proportion of each feature in the data formats. 


VI. Decisions 

The list above is not intended to be pejorative. However, it 
is potential users need to make decisions about which format 
to use. There are several issues that need be considered in such 
a decision, and although the first is the feature list required, 
there are others: 

data size : The size of the graph data to be recorded and 
used is an important factor in file format decisions. This 
is sometimes glossed over when XML-style formats are 
considered: these are very redundant formats, and hence 
much larger than needed, but they compress well. Hence, 
the compressed version may be no longer than a tighter 
initial specification. However, the issue of read/write 


time (and indeed compression/decompression time) still 
depends greatly on the format’s wordiness. Large graphs 
need tighter formats: either binary formats, or at least 
those that avoid unnecessary bloat. 

On the far end of the spectrum is the possibility of 
graph-specific compression being part of the storage 
process (much as many image formats provide image 
compression as an integral features). Only one format we 
found provides true graph-based compression: BVGraph. 
edge density : Edge density affects the choice of best rep¬ 
resentation of a graph. Very sparse graphs are best 
represented by edge lists, moderately sparse graphs are 
(perhaps) slightly better stored as neighbour lists, and 
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Structure : schema checking 
integral metadata : schema checking 
hyper graph : ports 
multi graph : multiple attributes 
multiple attributes : edge 
default values : visualisation data 
multi graph : default values 
multiple attributes : schema checking 
multi graph : ports 
default values : ports 
structure : proc 
edgeweights : multiple attributes 
multi graph : checked 

0.00000 0.00004 0.00008 0.00012 

P-value 

Fig. 11: P-values for significant associations between columns. 



dense graphs may be better stored as a full adjacency 
matrix. 

access method : Most graph formats are designed to be read 
serially directly into memory in their entirety. Only BV- 
Graph seems to provided support for random (or indexed 
subgraph) access to part of a graph. 

Another example of alternative access methods is that 
many graph algorithms can be reduced to a generalised 
matrix-vector product, and can be performed by repeat¬ 
edly streaming the edges from disk without loading the 
graph into memory, which is necessary if the data is truly 
large [104]. 

Further, formats could potentially enable reading the 
graph in parallel to exploit clustered computing [104]. 

In other cases, a single graph might be part of a larger 
database. 

In general these issues seem to have been left in the field 
of graph databases [1], and not considered for exchange 
of data. 

human readability : Portability requires the file to be ma¬ 
chine readable, but a file that is more easily understood 
by humans is potentially better because it is easier to enter 
and check. Many of graph examples datasets were entered 
at least in part by hand: often through a spreadsheet or 
text editor, and are maintained in the same way. In the 
case of the Internet topology Zoo [105] the data were 
entered “semi-manually” through yED (a graph editing 


program). 

Human readability requires a text file in a logical format, 
but it also needs to avoid: (i) bloat, which distracts 
the reader with unnecessary text, and (ii) the file to 
be organised neatly. XML formats often fail on these: 
the first because of the volume of tags, and the second 
because they allow organisations which are unreadable, 
e.g., with all the text on one line. 

Ultimately, human readability is a highly subjective crite¬ 
ria. Some people may find XML easy to read, and others 
get distracted by the tags. As such, we won’t comment 
on it further here. 

maintenance : The document [106] deals with the use cases 
for graphs, which we can broadly classify (in simpler 
nomenclature) as 

creator : originally creates the data set, 
investigator : uses the data for some purpose, and 
curator : refines and corrects the data. 

Most current graph-exchange formats are oriented at 
creation and investigation, but not curation. 

Data can easily contain errors, and correcting these ex 
post facto should be supported, but most formats do not 
deal with issues such as 

version control : to allow, for instance, users to know 
exactly which dataset was used in a particular publica¬ 
tion; and 

diff : the ability to find semantic differences between 
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two files to learn what changed between the two (as 
opposed to simply seeing syntactical differences). 
Taking differences of arbitrary graph data is hard (it in¬ 
volves solving the near-isomorphism problem), but much 
graph data is labelled and in this case differences can be 
found easily. 

documentation : Through compiling the information used in 
this paper it has become obvious that a key limitation 
of many formats is incomplete documentation. Hidden 
assumptions, specification by (limited) examples, and/or 
documentation by source code are all common. Ideally, 
any truly portable format should have a complete, highly- 
specific schema; human readable documentation (with 
examples); and source code. All of these together provide 
the ideal documentation. 

support : Finally, the support for the format in a variety 
of tools is a crucial requirement for exchange of data. 
Likewise, support for formats in a variety of public 
databases makes it more useful. We shall consider this 
issue in more detail below. 

A. Software Support 

The most difficult issue surrounding software support is 
that a piece of software may notionally support a file format, 
and yet still be incompatible with other software notionally 
supporting the same format. 

For instance, software might 

• fail to accept integers outside a particular range; 

• have varying case sensitivity; 

• be unable to read the right character set; 

• be unable to read strings beyond a particular length (very 
few formats specify buffer or string lengths); or 

• fail to cope with files larger than some size. 

Size is interesting, because almost no documentation exists 
for size limits for any data formats. However, it should be 
reasonably obvious that if 32 bit integers are used, then the 
largest number of (integer) identifiers is around 4 billion. In the 
past this was large enough that the need to specify it may have 
seemed small. With today’s graphs, this could be an important 
limitation. 

Even more pernicious is partial support for a format. Even 
when documented this makes our job hard, but partial support 
is not often documented. Instances include; 

• hyper-graphs supported in the format, but not in software; 
or 

• some small number of formats make mention of allowing 
complex numbers; or 

• partial support for hierarchy {i.e., the file can be read, but 
the subgraph structure is not retained). 

Even more complex is the fact that some features may be 
supported on read or write, but not both. 

The list of potential software is long, even more so than the 
list of formats, so we won’t try to survey them here as well. 
Instead we refer readers to [2], which contains a cross-section 
of both formats and their software support. 

A common conclusion amongst those who look at this type 
of data is that GraphML and Pajek are the most commonly 


supported in modern systems, but they are by no means 
universal or even supported by the majority of tools. 

Another related issue is how hard it would be to provide 
support for a format in a new tool. This is a complex issue, 
but there are several factors that influence it. Documentation, 
as mentioned above, is a critical issue, as is the ability to use 
existing tool-sets such as those for XML or JSON. However, 
one issue hasn’t been discussed, which is the provision of an 
adequate test dataset. 

1) Test cases: It’s a tautology that implementation of a new 
graph format isn’t terribly hard, except for the hard bits. The 
point is, though, that many formats don’t tackle these. 

Many areas of difficulty are listed above. One we have not 
discussed in detail is the existence of test cases. Ideally, in 
addition to a complete specification, there should be a set 
of accompanying files providing encoded data to demonstrate 
each feature over a reasonable range of values [107]. These 
files could then be used by other developers to check their 
parser implementations. 

The concept of test cases is from software engineering 101. 
However, we are not aware of a single format that provides 
a truly complete set. Some provide a set of small examples, 
but these don’t express all of the features of the data. Eor 
instance, encoding, size limits, advanced features and so on 
are rarely considered in these examples. Other formats are 
used for exchange of datasets, and these form a de facto 

More often, only a small set of examples is provided, and 
these don’t express all of the features of the data. Eor instance, 
encoding, size limits, advanced features and so on are rarely 
considered in these examples. Other exchange sets are used 
to provide large datasets, but these two are unsuitable for 
test purposes because they are large and complex, and don’t 
exercise features in isolation. What is needed, is a set of test 
cases that exercise the features in a controlled and testable 
manner. 

B. Public DB support 

The other type of support we might wish to see is general 
support amongst those who provide data publicly. There are 
many public databases that provide example networks for 
benchmarking or research. We provide a list in Table VI of 
some of the better known of these with their format choices. 
Additional data sources are listed in [108], and a detailed 
taxonomy and examples of computer-network data appears in 
[109]. 

There is no clear winner here; slightly preferred is a 
variant of the Trivial Graph Eormat due to its least-common- 
denominator status (but note that this isn’t really one format, so 
much as a collection of equivalent formats). Overall, however, 
the formats seem to be written for the data rather than the other 
way around. That, in itself, is an illustration of the problem. 

C. Future considerations 

There are many considerations or features that we could 
consider. The set chosen above were chosen for the illustrative 
value, given current graph exchange concerns. In the future, 
there are other features that could become interesting, and we 
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Dataset 


Full name 

Format 

ARG/VF 

[110] 

Mivia ARG Database and VF Library 

Mivia 

BioGRID 

[10] 

Biological General Repository for In¬ 
teraction Datasets 

PSI MI, Osprey, BioGRID, PTMTAB 

CASOS 

[111] 

CMU CASOS Datasets 

DyNetML, GML, UCINET, GraphML 

ClueWeb09 

[112] 

ClueWeb09 Web Graph 

BVGraph, TGF 

DIM ACS 10 

[113] 

DIMACS Implementation Challenges 

DIMACS 

DSI 

[114], [115] 

Web Algorithmics Lab Data 

BVGraph 

Enron 

[116], [117] 

Enron email dataset 

TGF 

Graph-Archive 

[118] 

GraphArchive - Exchange and Archive 
System for Graphs 

GraphML 

GraphBench 

[119] 

GraphBench 

TGF 

HOG 

[120] 

The House of Graphs 

TGF, Graph6, Multicode, Planar 

HPRD 

[121] 

Human Protein Reference Database 

PSI MI, TSV 

Hyperlink 

[122] 

Web Data Commons - Hyperlink 
Graphs 

Pajek, WebGraph 

lAM 

[123] 

lAM Graph Database Repository 

GXL 

ITDK 

[41] 

CAIDA Macroscopic Internet Topol¬ 
ogy Data Kit 

ITDK 

Zoo 

[105] 

Internet Topology Zoo 

GML, GraphML 

Matrix Market 

[51] 

Matrix Market 

Matrix Market 

NAS 

[124] 

NAS (NASA) Graph Collection 

GAMFF 

Pajek 

[125] 

Pajek Data Sets 

Pajek 

Rocketfuel 

[74] 

Rocketfuel 

Rocketfuel 

SGB 

[77], [126] 

Stanford GraphBase 

SGB 

SNAP 

[81] 

Stanford Network Analysis Platform 

SNAP 

Tore 

[127] 

Tore Opsahl Datsets 

UCINET, tnet 

Twitter 

[128] 

What is Twitter, a Social Network or a 
News Media? 

TGE 

UF 

[129] 

The University of Florida Sparse Ma¬ 
trix Collection 

Matrix Market, Rutherford-Boeing, 
Matlab 

WF 

[130] 

Wasserman and Faust datasets 

Pajek 


TABLE VI: Public Databases. NB: there is some overlap in the data kept in these repositories. 


list and discuss some of these in the following. In general, 
we have not tried to classify the file formats by these features 
simply because it seems that few formats support these, but 
information is sparse and it is difficult to be certain in many 
cases. Many of the issues cross over into issues that have 
been considered in the domain of graph databases [1], and 
so techniques to tackle the problems exist, but have not been 
applied to the world of exchanging data. We will discuss at 
least a few of these issues below. 

self-describing : this refers to whether a file provides its 
own definition of its format. XML arguably has this 
property, but still relies on correct semantic interpretation 
of arbitrary labels, for instance link “weight” could mean 
several different things, and have any number of different 
units. 

data distribution : most graph formats are monolithic in that 
the entire graph is held in one file. Even those that allow 
multiple files use this to structure the type of information 
each contains, not to spread the information evenly. 

As graph data becomes larger, and the need to query 
subsections of the graph grows, we need to be able to 
create modularity in the graph representation. Eormats 
that provide the ability to distribute the graph information 
over multiple (indexed) files provides a capability that 
could be very useful [100]. 

This type of consideration, however, seems to have been 
limited primarily to graph databases [1], not exchange 
formats. 


node list : does the format have a separate node list, or is this 
list implicit in the edges? 

multi-layer : generalisations of graphs can have a layer 
structure [97] (resembling in some cases hierarchy, and 
in some cases temporal evolution, but more flexible than 
either by itself). Multi-layer graphs can naturally be 
described by adjacency tensors, however, complete multi¬ 
layer support doesn’t yet appear in any format. 

linear indexing : Another consideration in classifying net¬ 
work graph formats in the future is whether they use 
linear indices [131], by which we mean that if the 
network has n nodes, then they are labelled 1, 2,..., n 
(equivalently we could start at 0). 

Linear indexes make a dataset easier to deal with at 
two levels. Eirstly, it is more efficient to store integers 
than arbitrary strings: so both node and edge lists can 
be read/written more quickly, but also when the data is 
read into a program if the node names are arbitrary then 
the node data needs an extra layer of indirection such as 
provided by an associative array, and for large datasets 
this can reduce performance compared to storing the data 
in a simply indexed array. 

The issue is primarily important for very large datasets, 
but these are becoming more common. 

Note that it doesn’t mean that nodes can’t be named: they 
can have all the usual meta-data one might associate with 
the node, but it means that the primary reference to the 
node is arithmetically simple to work with. 
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In general, matrix representations have an implicit linear 
indexing, but other formats are less clear about the 
issue. However, some illustrative examples include SNAP, 
which uses integer but not linear indices and Matlab and 
BVGraph, which both use linear indices. 

One can also imagine creating simple indexes into the 
edges, but this goes a step beyond any exchange formats 
goals so far. 

serialisation ; Many graph data formats are designed to be 
read into memory in their entirety. They do not support 
the ability to read through the data serially, and perform 
analysis on the fly. 

random access and/or queries : As noted, most graph data 
formats are designed to be read into memory in their 
entirety. 

But an even bigger limitation, even of those that can be 
read serially is that they do not support the ability to And 
information about an arbitrary link or node (or subset of 
such) without reading the whole data set (at least through 
to the relevant point). 

Again, this is only a problem for very large datasets, but 
clearly is a huge issue for such sets. Not least because it is 
easy to imagine datasets to large to be read into realistic 
RAMs, but also because this is hopelessly inefficient for 
certain types of analysis. 

Again, graph databases deal with this issue, but exchange 
formats have not, so far. 

parallel read/write : The monolithic nature of most graph 
exchange formats make them unsuitable for parallel writ¬ 
ing. It is hard to separate parts of a graph and write them 
independently. 

The fact that it is assumed that most files will be read 
in their entirety also limits the ability to parallelise read 
operations. 

Again graph databases attack this problem, whereas ex¬ 
change formats have not. 

D. Discussion 

The point of all this: what should be done here, how should 
one proceed. There are three major considerations: 

• what representation of a graph (or generalised graph) will 
be used: edge or neighbour list, adjacency matrix, paths, 
or some constructive or procedural approach; 

• what additional information is to be added, and how 
flexible this information should be; and 

• what encapsulation of the data is to be used (XML and 
more recently JSON seem to be favourites). 

Then there are a substantial set of other features and aspects of 
the dataset that should be considered in the choice of formats. 

VII. Conclusion 

The science of graphs and networks needs portable, well- 
documented, precisely-defined, exchange formats. There are 
many existing formats, and this paper seeks to unravel this 
mess, most notably with the aim of reducing the number of 
new formats developed. 


One size probably does not fit all though. There is a clear 
need for at least three major types of file format: 

• a general, flexible, extensible approach such as GraphML; 

• a quick and dirty approach that satisfies the least common 
denominator for the exchange of information to/from the 
simplest software; and 

• a very efficient (compressed) format for very large graphs. 

Its not clear that any format at present has a complete 

enough list of features to take the roll of the first format. 
No doubt this will continue to evolve as well, as new features 
are required. Moreover, the requirement human readability of 
the data is evolving as more datasets are generated through 
automated means rather than entered by hand. 

The second is easy, but there are very many contenders, and 
settling on one will be hard. 

The final one should be seen as an interesting research topic 
given there are multiple compression techniques available. 
However, the only true example of a compressive format is 
BVGraph does not allow attributes, and so some thought might 
be devoted to that topic. 

Finally, although having arbitrarily extendable attributes for 
the graph and its components seems an attractive feature, it 
is easy to see why specialised applications would prefer a 
pre-defined list. Most obviously to make support for those 
attributes easier (both in terms of parsing®, and in terms of 
exchange^). However, there is also the subtle issue of what 
attributes could be included vs those that should be included. 
Explicit definition of the required attributes can create a better 
overall set of data by forcing the lowest-common-denominator 
to be higher. 

In the end, maybe what is needed is actually a container 
format: allowing specification of parts of a graph in alternative 
formats. Or allowing specification of meta-data and labels in 
an XML-like format, but the edge data in a more compact 
form. 

Alternatively, good conversion programs could simplify the 
issue, but at present most software tools are not designed with 
this in mind (for instance, such a tool needs to be lightweight, 
but warn about different available features, and support a large 
range of possibilities). 
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